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This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to 
estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory 
of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate 
of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/ \fm. This improves on 
the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% 
of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10 9 
with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and 
adapts to the sliding window model. 



Introduction 

The purpose of this note is to present and analyse an efficient algorithm for estimating the number of 
distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets 
and are usually massive streams (read-once sequences). This problem has received a great deal of attention 
over the past two decades, finding an ever growing number of applications in networking and traffic 
monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service), 
and of link-based spam on the web [3 1. For instance, a data stream over a network consists of a sequence 
of packets, each packet having a header, which contains a pair (source-destination) of addresses, followed 
by a body of specific data; the number of distinct header pairs (the cardinality of the multiset) in various 
time slices is an important indication for detecting attacks and monitoring traffic, as it records the number 
of distinct active flows. Indeed, worms and viruses typically propagate by opening a large number of 
different connections, and though they may well pass unnoticed amongst a huge traffic, their activity 
becomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [ 11 1). 
Other applications of cardinality estimators include data mining of massive data sets of sorts — natural 
language texts 0[5), biological data lfT7l[T8l . very large structured databases, or the internet graph, where 
the authors of ll22l report computational gains by a factor of 500 + attained by probabilistic cardinality 
estimators. 
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Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentially 
proportional to its number of elements. However, in most applications, the multiset to be treated is far too 
large to be kept in core memory. A crucial idea is then to relax the constraint of computing the value n of 
the cardinality exactly, and to develop probabilistic algorithms dedicated to estimating n approximately. 
(In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range 
of algorithms have been developed that only require a sublinear memory J20[IOl[TT][l5][T6), or, at worst 
a linear memory, but with a small implied constant l24l . 

All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash 
functions. The elements to be counted belonging to a certain data domain T>, we assume given a hash 
function, h : T> — > {0, 1}°°; that is, we assimilate hashed values to infinite binary strings of {0, 1}°°, or 
equivalently to real numbers of the unit interval. (In practice, hashing on 32 bits will suffice to estimate 
cardinalities in excess of 10 9 ; see Section|4]for a discussion.) We postulate that the hash function has been 
designed in such a way that the hashed values closely resemble a uniform model of randomness, namely, 
bits of hashed values are assumed to be independent and to have each probability | of occurring — 
practical methods are known [20|, which vindicate this assumption, based on cyclic redundancy codes 
(CRC), modular arithmetics, or a simplified cryptographic use of boolean algebra (e.g., shal). 

The best known cardinality estimators rely on making suitable, concise enough, observations on the 
hashed values h(Ai) of the input multiset M, then inferring a plausible estimate of the unknown cardi- 
nality n. Define an observable of a multiset S = h(M.) of {0, 1}°° strings (or, equivalently, of real [0, 1] 
numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of 
replications. Then two broad categories of cardinality observables have been studied. 

— Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of 
the (binary) 5-values. For instance, observing in the stream S at the beginning of a string a bit- 
pattern 0 p-l 1 is more or less a likely indication that the cardinality n of S is at least 2 P . The 
algorithms known as Probabilistic Counting, due to Flajolet-Martin [15|, together with the more 
recent LogLog of Durand-Flajolet [ 10 1 belong to this category. 

— Order statistics observables: these are based on order statistics, like the smallest (real) values, that 
appear in S. For instance, if X = min(5), we may legitimately hope that n is roughly of the order 
of 1 jX, since, as regards expectations, one has K(X) — l/(n + 1) . The algorithms of Bar-Yossef 
et al. El and Giroire's MinCOUNT £l6][l8l are of this type. 

The observables just described can be maintained with just one or a few registers. However, as such, 
they only provide a rough indication of the sought cardinality n, via log 2 n or 1/n. One difficulty is due 
to a rather high variability, so that one observation, corresponding to the maintenance of a single variable, 
cannot suffice to obtain accurate predictions. An immediate idea is then to perform several experiments 
in parallel: if each of a collection of m random variables has standard deviation a, then their arithmetic 
mean has standard deviation a/y/m, which can be made as small as we please by increasing m. That 
simplistic strategy has however two major drawbacks: it is costly in terms of computation time (we would 
need to compute m hashed values per element scanned), and, worse, it would necessitate a large set of 
independent hashing functions, for which no construction is known |Q]. 

The solution introduced in lfl5ll under the name of stochastic averaging, consists in emulating the 
effect of m experiments with a single hash function. Roughly speaking, we divide the input stream 
h(A4) into m substreams, corresponding to a partition of the unit interval of hashed values into [0, — [, 
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Fig. 1: A comparison of major cardinality estimators for cardinalities < N: (i) algorithms; (ii) memory complexity 
and units (for N < 10 9 ); (Hi) relative accuracy. 

[—,—[, ■ • • , [ "!T , 1] • Then, one maintains the to observables Ox , . . . , O m corresponding to each of 
the to substreams. A suitable average of the {Oj} is then expected to produce an estimate of cardinalities 
whose quality should improve, due to averaging effects, in proportion to 1/ \fra, as m increases. The 
benefit of this approach is that it requires only a constant number of elementary operations per element of 
the multiset M. (as opposed to a quantity proportional to to), while only one hash function is now needed. 

The performances of several algorithms are compared in Figure [T] see also lfl3l for a review. Hyper- 
LogLog, described in detail in the next section, is based on the same observable as LogLog, namely 
the largest p value obtained, where p(x) is the position of the leftmost 1 -bit in binary string x. Stochastic 
averaging in the sense above is employed. However, our algorithm differs from standard LogLog by its 
evaluation function: its is based on harmonic means, while the standard algorithm uses what amounts to 
a geometric mearQ The idea of using harmonic means originally drew its inspiration from an insight- 
ful note of Chassaing and Gerin |6|: such means have the effect of taming probability distributions with 
slow-decaying right tails, and here they operate as a variance reduction device, thereby appreciably in- 
creasing the quality of estimates. Theorem[T]below summarizes our main conclusions to the effect that the 
relative accuracy (technically, the standard error) of HyperLogLog is numerically close to j3 x /^fm, 
where (3^ = ^/31og2 — 1 = 1.03896. The algorithm needs to maintain a collection of registers, each 
of which is at most log 2 log 2 N + 0(1) bits, when cardinalities < A*" need to be estimated. In particular, 
HyperLogLog achieves an accuracy matching that of standard LogLog by consuming only 64% of 
the corresponding memory. As a consequence, using to = 2048, hashing on 32 bits, and short bytes of 5 
bit length each: cardinalities till values over N = 10 9 can be estimated with a typical accuracy of 2% 
using 1.5kB (kilobyte) of storage. 

The proofs base themselves in part on techniques that are now standard in analysis of algorithms, 
like poissonization, Mellin transforms, and saddle-point depoissonization. Some nonstandard problems 
however present themselves due to the special nonlinear character of harmonic means, so that several 
ingredients of the analysis are not completely routine. 

1 The HyperLogLog algorithm 

The HyperLogLog algorithm is fully specified in Figure[2j the corresponding program being discussed 
later, in Section [4] The input is a multiset M. of data items, that is, a stream whose elements are read 
sequentially. The output is an estimate of the cardinality, defined as the number of distinct elements 

1 The paper 1101 also introduces a variant called SUPERLOGLOG, which attempts to achieve variance reduction by censoring 
extreme data. It has however the disadvantage of not being readily amenable to analysis, as regards bias and standard error. 
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Let h : T> — > [0, 1] = {0, 1}°° hash data from domain T> to the binary domain. 
Let p(s),for s G {0, 1}°°, be the position of the leftmost 1-bit (p(0001 • ■ ■ ) = 4). 

Algorithm HyperLogLog (input M : multiset of items from domain V). 
assume m = 2 b with b € Z>o; 

initialize a collection of m registers, M[l], . . . , M[m], to — oo; 

for v g M do 

set x := h(v); 

set j = 1 + {xiX2 ■ ■ ■ Xb)2\ {the binary address determined by the first b bits ofx] 
set w := x b+1 x b+2 ■ ■ ■ ; set M[j] := max(M[j], p(w)); 

compute Z := I ; {the "indicator" function} 

return E := a m m 2 Z with a m as given by Equation {3}. 



Fig. 2: The HyperLogLog Algorithm. 



in M.. A suitable hash function h has been fixed. The algorithm relies on a specific bit-pattern observable 
in conjunction with stochastic averaging. Given a string s G {0, 1}°°, let p(s) represent the position 
of the leftmost 1 (equivalently one plus the length of the initial run of 0's). The stream M. is split into 
substreams M.\, . . . M. m , based on the first b bits of hashed values^jof items, where m — 2 b , and each 
substream is processed independently. For JV = Aij such a substream (regarded as composed of hashed 
values stripped of their initial b bits), the corresponding observable is then 

Max(TV) := m&xp(x), (1) 

with the convention that Max(0) = — oo. The algorithm gathers on the fly (in registers M [j]) the values 
M^' of Max(Aij) for j = 1 . . . , to. Once all the elements have been scanned, the algorithm computes 
the indicator, 

I rn 

W '" 1 (2) 
It then returns a normalized version of the harmonic mean of the 2 M in the form, 

E := ^„?"^ M(3) , with a m :=(m[ (log 2 (^^\) <la) (3) 




Em n_ 
.7 = 1 Z 



0 



Here is the intuition underlying the algorithm. Let n be the unknown cardinality of AL Each substream 
will comprise approximately n/m elements. Then, its Max-parameter should be close to log 2 (n/m). The 
harmonic mean (mZ in our notations) of the quantities 2 Max is then likely to be of the order of n/m. 
Thus, m 2 Z should be of the order of n. The constant a m , provided by our subsequent analysis, is finally 
introduced so as to correct a systematic multiplicative bias present in m 2 Z. 

Our main statement, Theorem[T]below, deals with the situation of ideal multisets: 

2 The algorithm can be adapted to cope with any integral value of m > 3, at the expense of a few additional arithmetic operations. 
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Definition 1 An ideal multiset of cardinality n is a sequence obtained by arbitrary replications and per- 
mutations applied to n uniform identically distributed random variables over the real interval [0, 1]. 

In the analytical part of our paper (Sections [2] and [3j, we postulate that the collection of hashed values 
h(A4), which the algorithm processes constitutes an ideal multiset. This assumption is a natural way to 
model the outcome of well designed hash functions. Note that the number of distinct elements of such an 
ideal multiset equals n with probability 1. We henceforth let E„ and V„ be the expectation and variance 
operators under this model. 

Theorem 1 Let the algorithm HyperLogLog of Figure^be applied to an ideal multiset of (unknown) 
cardinality n, using m > 3 registers, and let E be the resulting cardinality estimate. 

(i) The estimate E is asymptotically almost unbiased in the sense that 

— E n (E) = 1 + <5i(?i) + o(l), where \5i(n)\ < 5 • 10~ 5 as soon as m > 16. 

Ti n — >oo 

(ii) The standard error defined as ^ \/Y n (E) satisfies as n — > oo, 

— \/Y n (E) — —j= + S2(n) + o(l), where \52(n)\ < 5 ■ 1CP 4 as soon as m > 16, 

n n— >oo y/m 

the constants f3 m being bounded, with f3\Q = 1.106, fi^ = 1.070, /?64 = 1.054, /3i28 = 1-046, and (3oo — 
^31og(2) - 1 = 1.03896. 

The standard error measures in relative terms the typical error to be observed (in a mean quadratic 
sense). The functions Si(n), 62(12) represent oscillating functions of a tiny amplitude, which are com- 
putable, and whose effect could in theory be at least partly compensated — they can anyhow be safely 
neglected for all practical purposes. 

Plan of the paper. The bulk of the paper is devoted to the proof of Theorem [T] We determine the 
asymptotic behaviour of E n (Z) and V n (Z), where Z is the indicator 2 _M . The value of a m in 
Equation ((3), which makes E an asymptotically almost unbiased estimator, is derived from this analysis, 
as is the value of the standard error. The mean value analysis forms the subject of Section [2] In fact, 
the exact expression of E„(Z) being hard to manage, we first "poissonize" the problem and examine 
E-p(^) (Z), which represents the expected value of the indicator Z when the total number of elements is not 
fixed, but rather obeys a Poisson law of parameter A. We then prove that, asymptotically, the behaviours of 
K n (Z) and ¥spr\\(Z) are close, when one chooses A := n: this is the depoissonization step. The variance 
analysis of the indicator Z, hence of the standard error, is sketched in Section [3] and is entirely parallel to 
the mean value analysis. Finally, Section [4] examines how to implement the HyperLogLog algorithm 
in real-life contexts, presents simulations, and discusses optimality issues. 

2 Mean value analysis 

Our starting point is the random variable Z (the "indicator") defined in Q. We recall that E„ refers to 
expectations under the ideal multiset model, when the (unknown) cardinality n is fixed. The analysis 
starts from the exact expression of E n (Z) in Proposition [T] continues with an asymptotic analysis of the 
corresponding Poisson expectation summarized by Proposition|2j and concludes with the depoissonization 
argument of Proposition|3] 
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2. 1 Exact expressions 

Let Af be an ideal multiset of cardinality v. The quantity Max(TV) = maxag^/" p( x ) is the maximum of v 
independent random variables, namely, the values p(x). Each random variable, call it Y, is geometrically 
distributed according to ¥(Y > k) = 2 1_fe , for k > 1. Thus, the maximum M of v such random 
variables satisfies P(A/ = k) = (l — p-j — (l — o£=t) ' ^ or y — ^ ^ et now an ideal multiset of fixed 
cardinality n be split into m "submultisets" of respective (random) cardinalities . . . ,N^ m \ The 

joint law of the N^' is a multinomial. The combination of the previous two observations then provides: 

Proposition 1 The expectation of the indicator Z resulting from an ideal multiset of fixed cardinality n 
satisfies 

1 / n \ 1 m 1 1 

E »(^)- E E ™ 2 - fcj E U,---,nJm^n^ >^ = (1-2S)"-(1-2S= 

fei,...,fe m >l ^J=l ni+...+n m =n v ^ ' m/ j=l 

(4) 

/or i/, fc > 1 anrf 70, u = 0. 

Note that, under the convention that registers M") are initialized to —00, we have Z — 0 as soon as any 
of the registers has remained untouched — this explains the fact that summation in Q only needs to be 
taken over register values kj > 1. 

The rather formidable looking expression of Q is to be analysed. For this purpose, we introduce the 
Poisson model, where an ideal multiset is produced with a random size N distributed according to a 
Poisson law of parameter A: ¥(N = n) = e~ A A" jn\. Then, as shown by a simple calculation, we have: 
Under the Poisson model of rate A, the expectation of the indicator Z satisfies, 

1 ™ / A \ 

®V(X)(Z)= ]T m Ilg(^w)' where = e~ x - e- 2x . (5) 

k u ...,k m >l ^i = l j=l ^ ' 

The verification is based on the relation, 

A™ 

n>0 

and series rearrangements. (Equivalently, independence properties of Poisson flows may be used.) 

2.2 Asymptotic analysis under the Poisson model 

The purpose of this subsection is to determine the asymptotic behaviour of the Poisson expectation, 
E-p(.\)(Z), as given by Equation (j5). Our main result in this subsection is summarized in the follow- 
ing proposition: 

Proposition 2 With a m as in (J3J, the Poisson expectation E-p(^)(iJ) satisfies 

E-p/x)(Z) = — ( h e m ( — J + o(l) ] , where \e m (t)\ < 5 • 10~ 5 /m as soon as m > 16. 

A^oo m \ma m \m J J 

(6) 

The proof of Proposition [2] consists of three steps: (i) the Poisson expectation is first expressed in 
integral form (Equation |9])); (m) the integrand is next analysed by means of the Mellin transform (Lemma 
[TJ; (in) the outcome of the local Mellin analysis is finally used to estimate the Poisson expectation. 
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The integral representation. The asymptotic analysis of the Poisson expectation departs from the usual 
paradigm of analysis exemplified by iflOl fl4l l23l because of the coupling introduced by the harmonic 
mean, namely, the factor 2 ' This is remedied by a use of the simple identity 



1 ' -at 



O 

which then leads to a crucial separation of variables, 



dt, (7) 



E v(mx) (Z) = v m 2~ k i II 9 (i) 

fei,...,fe m >i j=i 

Y] / Y[g(2~ k ^x)e' t ^T=i 2 3 dt = / G(x,t) m dt, 

k u ...,k m >l J ° 3 = l J ° 



where we have set 



k>l 



Then the further change of variables t = xu leads to the following useful form: The Poisson expectation 
satisfies, with G(x, t) defined by (|8j: 

^v(x){Z) = H ( — ) , where H(x) := x / G(x,xu) m du. (9) 

Analysis of the integrand. Our goal is now to analyse the integral representation |9) of Poisson averages. 
We make use of the Mellin transform, which to a function f(t) defined on R >0 , associates the complex 
function 

f(s) := / fW-Ut. (10) 
Jo 

One fundamental property is that the transform of a harmonic sum, F(x) = J^k ^kf{l^kx),factorizes, as 
F*(s) = (J2k ^fcMfe S ) f*( s )- Another fundamental property (devolving from the inversion formula and 
a residue calculation) is that the asymptotic behaviour of the original function, f, can be read off from the 
singularities of its transform /*; see |[T4l for a survey. We prove: 

Lemma 1 For each fixed u > 0, the function x > G(x, xu) has the following asymptotic behaviour as 

x — ^ ~f~oo 

G(r tv)- / f^)^ + 0(x- 1 ))+ue(x,u)ifu<l 

U{X ' XU) ~\ f(u)(l + 2e(x,u)+0(x- 1 ))ifu>l. ' (lL) 

where f(u) — log 2 (fq^j^, the O error terms are uniform in u > 0, and \e\, |e] < eo ~ 7 ■ 10~ 6 for 
x>0. 

Proof: Write h u (x) := G(x, xu). This function is a harmonic sum, 

h u (x) = ^g(2- k x)e- 2 ~ kxu = ^g(a;2- fe ), with q(x) := g(x)e~ xa , 

k=l k=l 
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so that its Mellin transform factorizes, in the fundamental strip (—1, 0), as 

/ +oo \ 



K(s) = ( E 2fes I = ((! + u r s ( 2 + u r s ) - 



(12) 



where T(s) is the Euler Gamma function. The asymptotic behaviour of h u (x) as x — * +oo is then 
determined by the poles of fe£(s) that lie on the right of the fundamental strip. The poles of ft.*(s) are 
at Z<o (because of the T factor) and at the complex values {i]k ■= 2ikir/ log(2), k € Z}, where the 
denominator 1 — 2 s vanishes. The Mellin inversion formula, 



h u (x) 



1 

2l7T 



-1/2+ioo 



1/2— ioo 



/i*(s)x s ds, 



when combined with the residue theorem, implies 

/i„(x) = -ERes«(s)a;- s ,r ?fe ) + — / /4(s)x~ s ds. 

1. 1-77 J 1 200 



(13) 



fcGZ 



Some care is to be exerted in estimations since uniformity with respect to the parameter u is required. 
The residues are given by 



Res (h*(s)x~ s , t) k ) = ^2«""* r (»»fc) (C 1 + «)""* - ( 2 + ( fc e Z ^o) 

Res(fti(*)x-«,0) = JLlogfJl^ (fc = 0). 



log 2 



As regards their sum in (13) , we note the inequality, valid for 5ft(s) > 0, 

|(l + 7l)- S - (2 + u)- S \ = | e - sl °8(l+«) _ e - sl °8(2+«)| < | s | 



log 



2 + u 
1 + u 



(14) 



(verified by writing the difference as the integral of its derivative, then bounding the derivative), and its 
companion, valid for s = rjk (to be used for u close to 0) 



l((l 



(2 + u)- 



(15) 



verified by the strict decrease of u i— > | (1 + u) s — (2 + u) s \/u. As a consequence, one obtains the two 
simultaneously valid bounds 



J2 Res (Kis)x s ,ry fc )-log 2 ( - 
fcez ^ 



2 + u 



(log 2f 



E i fcr (%)i 



< 



2tt 



(log 2)' 



log 



fcez,fc^o 
2 + u 



1 + u 



E i fcr (%)i' 
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5^Res(fc£(s)a; s ,r) k ) - log 2 ( - 



fcez 



2 + u 



< 



e 0 u 

2e 0 log 



2 + u 
1 + u 



if u < 1 
if u > 1. 



Next, we turn to the integral remainder in ( [13) , By the inequality of ( [14) , one has uniformly 



l+zoo 



h* u {s)x s ds 



1 /2 + w 
< - log — — 
x V 1 + w 



\aT(s)\d\a\ = 0 - log 



2 + u 
1 + u 
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(16) 



(17) 



The two bounds ( [To*) and ( flT) then justify the statement. 



Final asymptotics of the Poisson averages. There now remains to estimate the function H(x) of ([9j, 
with Lemma [JJ providing precise information on the integrand. Accordingly, we decompose the domain 
of the integral expressing H(x), 

-H{x)= f (f(u) + ue(x,u)) m du+ f f(u) m (l + 27(x,u)) m du + o(l) = A + B + o(l), 
x Jo Ji 

with A = and B = . Here, like before, we have set f(u) := log 2 (f^) • 
We first estimate A by means of the inequality f(u) < 1 — 2u/5, for u € [0, 1]: 



A- 



f{u) m du 



< 



E 



(l-2w/5) m " fc (ue 0 ) ,c dit 



m + 1 

Thus, upon bounding a'(v) near 0, one finds: 



(1 - 2u/5 + ue 0 ) m - (1 - 2u/5) m du 
(a(e 0 ) — a(0)) , where a(vj = 



2/5 -v 



A- (f(u)) m du 



< 



6.26 e 0 
ra + 1 ' 



(18) 



As regards the estimation of _B, it suffices to note that, for u > 1, one has f(u) < 1/((1 + u) log 2) in 
order to get directly: 



B- 



< 



(1 + 2e 0 ) m - 1 



m-1 V (21og(2)) r 



The combination of ( fT8) and ( [19) finally give us 

H(x)=x( [ f(u) m du + e m (x) + o(l) 



(19) 



(20) 



where |e m (x)| < 5 • 10 5 /to (for m > 16). This last estimate applied to the expression |9| of Poisson 
averages then concludes the proof of Proposition [2] 
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2.3 Analysis under the fixed-size model (depoissonization) 

We can now conclude the average-case analysis of the main indicator Z by showing that the asymptotic 
approximation derived for the Poisson model (Proposition [2]) applies to the fixed size model, up to negli- 
gible error terms. To this aim, we appeal to a technique known as "analytic depoissonization' , pioneered 
by Jacquet and Szpankowski (see 1 19 1 and [23 p. 456]) and based on the saddle point method. To wit: 

Theorem (Analytic depoissonization). Let f(z) = e~ z fkZ k /k\ be the Poisson generating function 
of a sequence (/&). Assume f(z) to be entire. Assume also that there exists a cone Sg — {z / z — 
re 1 ^, |0| < 9} for some 9 < J and a real number a < 1 such that the following two conditions are 
satisfied, as \z\ — > oo: 

Ci: for z G Sg, one has \ f{z) \ = 0(\z\), 

C 2 : forz i S e , one has \f(z)e z \ = 0(e a ^). 

Then f n = f(n) + 0(1). 

The use of this theorem amounts to estimating Poisson averages (the quantity f(z)), when the Poisson 
rate X = zis allowed to vary in the complex plane, in which case it provides a way to return asymptotically 
to fixed-size estimates (the sequence /„). The Mellin technology turns out to be robust enough to allow 
for such a method to be used. 

Proposition 3 The expectation of the mean value of the HyperLogLog indicator Z applied to a multiset 
of fixed cardinality n satisfies asymptotically as n — ► oo 

E n (Z)=E v{n) (Z) + 0(l). 



Proof: We apply analytic depoissonization to the integral expression which we repeat here: 

/ x \ r°° r +co 

E v(x) (Z) = H (— , where H(x) := G(x,t) m dt = x G(x,xu) m du. (21) 



ii 



We shall check the conditions Cj, C 2 of analytic depoissonization, for f(z) := H(z/m), choosing the 
half-angle of the cone to be 9 = ir/3 and a = 3/5. 

Inside the cone (Condition Ci). It is sufficient to establish that H(z) = 0(\z\). The bounds are 
obtained via the second integral form of ( pT| by suitably revisiting the Mellin analysis of Subsection |2.2| 
Start from Equation ( [13) , which expresses the Poisson expectation as a sum of residues, plus a remainder 
integral. When x is assigned a complex value z = re 1 ^, the quantity z~ s satisfies, for s = a + it: 



There, the second factor has modulus equal to 1, while the first remains 0(r~ CT e 7r ' T '/ 3 ) within the cone. 
Then, given the fast decay of T(s) towards ±ioo, the series of residues in ( fTJI is still bounded and the 
remainder integral is itself 0(1). Thus the main estimate stated in LemmajTfand relative to h u (x) = 
G(x, xu) remains valid for x — z within the cone, and the argument can be extended to show that the 
asymptotic form of H(x) also holds (only the numerical bounds on the amplitude of the fluctuations need 
to be weakened). As a consequence, one has H(z) = 0(\z\), hence H(z/m) — 0(\z\) within the cone. 
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Outside the cone ( Condition C2 )■ Start from the first integral in ( |2"Tj ). We subdivide the domain accord- 
ing to U(z) < 0 and $t(z) > 0. For the case < 0, we set L := |k>g 2 (M/m.)J . The definition of G 
in dS} gives 



G 



k<L 



m2 



/?? 



2 k 



so that 



G 



(->*)! - J2 2 \ e ~ z/m \ e ' t/2L + C J2 



k<L 



k>L 



Z 

1 m2 k 



-t/2 k 



for some C > 0. Hence, raising to the mth power, we get 



G 



Vm / 



< 4 m £ 



e - 2 g- m */l z l 



\k>i J 



where use has been made of the inequality (X + Y) m < 2 m X m + 2 m Y m , valid for arbitrary X, Y > 0. 
Now, since the integral 



r — 

1 m ■ — 

converges, we obtain, upon integrating, 



(22) 



#(-) < / G(— ,t) dt< A \e~ z z\og 2 \z\ m \+B\z m \, 
Vm/ j 0 Vm / 1 1 

where ^4 and _B are constants (depending on m). Consequently, for < 0, we have \e z H (z/m) 
0(e a ' z ') for any a > 0, and in particular we may adopt a = §. 
For the case 5ft(z) > 0, it suffices to note that 



Vm 



2 fe 



-t/2 fc 



Raising to the mth power and integrating, we get 



H 



(") 

Vm/ 



< 



m 



dt < \z\ m T m , 



with r„, as in S22\, so that 



e z H 



= 0{\e z z m \) = 0(e |z|/2 z m ) = 0(e 3|z|/5 ) 



since we have < outside the cone. This last inequality completes the proof of the statement. 



The proof of the unbiased character of HyperLogLog, corresponding to Part (i) of our main The- 
orem [T] is thus essentially complete: it suffices to combine Propositions [2] and [3] giving the asymptotic 
estimation of the expectation K n (Z) of the indicator, in order to get the expected value of the estimator 
K n (E) w n via a simple normalization by the constant factor m 2 a m . 
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3 Variance and other stories 

3. 1 Variance analysis 

The estimation of the variance of the indicator, namely Y n (Z) = E„(Z 2 ) — E 2 (Z), serves to justify 
Part (ii) of our main Theorem [I] and hence characterizes the accuracy of HyperLogLog. Since the 
analysis develops along lines that are entirely parallel to those of Section [2] we content ourselves with a 
brief indication of the main steps of the proof. 

The starting point is an expression of the moment of order 2 of the indicator Z under the Poisson model, 



2 



m2 k i 



1 

~n2 



E VW ( Z i)= £ ( 1 ) n 

kt,...,k m >l \ Z ^J= 1 / j-1 

which is the analogue of (|5j. Then, the use of the identity 

/POO 
te- at dt 
o 

leads to the integral form (compare with (|9|) 

^V(\)(Z 2 ) = K f^j , where K(x) := x 2 J 

The integral being very close to the one that represents E-p(^(Z), the analysis of the integrand available 
from Lemma[T]can be entirely recycled. We then find 



uG(x, xu) m du. (24) 



K(x) = x 2 I J uf(u) m du + e' m (x) + o(l) J , (25) 
where e'(x) is a small oscillating function, implying 

Y V{X )(Z) - A 2 ^jT° uf{u) m du - ^jT°° f(u) m du^j + e"(X) + o(l)J (26) 

(with e" small), which constitutes the analogue of Proposition]^ 

The last estimate (|26j> can then be subjected to depoissonization (with a proof similar to that of Propo- 
sition^, to the effect that 

Y n (Z)=V nn) (Z) + 0(n). (27) 

This shows that the standard error, measured by i y / V„(-E), is, for each fixed to, asymptotic to a constant 
asm oo, neglecting as we may tiny fluctuations. The stronger property that this constant is of the form 
0m/ \f(n (with (3 m bounded) is established in the next subsection. 



3.2 Constants 

There only remains to discuss the proportionality constants that determine the shapes of the bias-correction 
constant a m specified in ([3| and of the standard-error constant (3 m of the statement of Theorem [T] Define 
the special integrals 



J s (m) = / u s f(u) m du. 



o 
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We have from Equations Q and (j26 



mJ 0 (m)' V J 0 (m) 2 

The integrals J s (m) are routinely amenable to the Laplace method lIHl : 

J 0 (m) = fl + -(31o g 2-l) + 0(m- 2 ) 

m V m / ^g) 

J!(m) - (2 log 2) / + 3_ 2 _ + q(to - 2) x 
m- 2 \ rn 

Thus the bias correction a m and the variance constant f3 m satisfy 

am ~ 2loi2 = °- 72134 ' ' 8m ~ a/3 log 2- 1 = 1.03896 (m -►+«>), (29) 

which turn out to provide good numerical approximations, even for relatively low values of m. These 
estimates imply in particular that j3 m remains bounded for all m > 3, which concludes the proof of 
Theorem Q] 

Additionally, we observe that the constants a m , f3 m belong to an interesting arithmetic class: the in- 
tegrals J 0 ( m )i Ji{fn) are expressible as rational combinations of L = log 2, values of the Riemann 
zeta function at the integers, and poly logarithms evaluated at |. For instance: Jq(2) = — 2, 

(3) = - 2, and 

Jo(4) = Zi -^ 4 -7r 2 L 2 -3L 4 -21LC(3)-24Li 4 (-)j , where Li r (z) := ^ — . 

^ ' n>l n 

They can thereby be computed to great accuracy. 



4 Discussion 

We offer here final reflections concerning an implementation of the HyperLogLog algorithm (Figures[3| 
[4] and [5]) as well as some surrounding complexity considerations. 

The HyperLogLog program. A program meant to cope with most practical usage conditions is de- 
scribed in Figure[3] In comparison to the algorithm of Figure [2] one modification regarding initialization 
and two final corrections to the estimates are introduced. 

(?) Initialization of registers. In the algorithm of Figure[2j registers are initialized at — oo. This has the 
advantage of leading to expressions of the average-case that are comparatively simple: see Equa- 
tions Q and However, a consequence is that the estimate E returned by the algorithm assumes 
the value 0 as soon as one of the registers has been left untouched, that is, as soon as one of the m 
substreams is empty. Given known fact regarding the coupon collector problem, this means that we 
should expect E = 0 when n <C mlog m, so that the algorithm errs badly for small cardinalities. 
In the program of Figure [3] we have changed the initialization of registers to 0. The conclusions 
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Let h : D — > {0, l} 32 hash data from D to binary 32-bit words. 

Let p(s) be the position of the leftmost 1-bit of s: e.g., p(l ■ ■ ■ ) = 1, p(0001 • ■ ■ ) = 4, p(0 K ) = 
K + l. 

define a 16 = 0.673; a 32 = 0.697; a 64 = 0.709; a m = 0.7213/(1 + 1.079/ra) for m > 128; 

Program HyperLogLog (input M : multiset of items from domain V). 
assume m = 2 b with 6 € [4 . . 16]. 

initialize a collection of m registers, M[l], . . . , M[m], to 0; 

for v £ M do 

set x := h(v)\ 

set j = 1 + (X1X2 ■ ■ ■ Xb)2\ {the binary address determined by the first b bits of x} 
set w := xb+iXb+2 • • • ; 
setA/[j] := max(M[j], p{w))\ 

, m v -1 

compute E := a m m 2 ■ I 2 _M ' J ' J ; [the "raw" HyperLogLog estimate} 

\ J=1 / 

if E < fmthen 

let V be the number of registers equal to 0; 

if V / 0 then set E* :— m \og(m/V) else set E* := E; {small range correction} 
if E < ^2 32 then 

set E* := E\ {intermediate range — no correction} 

if E > ^2 32 then 

set E* := — 2 32 log(l — E/2 32 ); {large range correction} 

return cardinality estimate E* with typical relative error ±1. 04:/y/rn. 



Fig. 3: The HyperLogLog Program dimensioned for maximal cardinalities in the range [0 . . 10 9 ] and for common 
"practical" values m = 2 4 , . . . , 2 16 . 



of Theorem [T] regarding the asymptotically unbiased character of the estimate, are still applicable 
to the program, since all substreams are nonempty with an overwhelming probability, as soon as 
n» m log m. The advantage of the modification is that we can now get usable estimates even when 
n is a small multiple of m (this fact can be furthermore confirmed by Poisson approximations). The 
estimates provided by the program for very small values of n (say, n a constant or n = m) can then 
be effectively corrected, as we explain next. 

(ii) Small range corrections. For the HyperLogLog program (including the modification of (i) 
above, regarding register initialization), extensive simulations demonstrate that the asymptotic regime 
is practically attained (without essentially affecting the nominal error of 1.04/y^ and without de- 
tectable bias) at the cardinality value n — |m, when m > 16. In contrast, for n < |m, nonlinear 
distortions start appearing — on the extreme side, the raw algorithm with registers initialized to 0 
will invariably return the estimate a m m = 0.7m when n — 0 (!). Thus, corrections must be 
brought to the estimate, when E (i.e., n) is comparatively small with respect to m. 

The solution comes from probabilistic properties of random allocations, as already exploited by the 
Hit Counting algorithm of Whang et al. [24], whose analysis is discussed in J9] Sec. 4.3]. Say 
n balls are thrown at random into m bins. Then, as it is well-known, the number of empty bins is 
about me"* 1 , where /i := n/m. Thus, upon observing V empty bins amongst a total of m, one 
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may legitimately expect fi to be close to log(m/V), that is, n must be close to m\og(m/V). (The 
quality of this estimate can be precisely analysed, since exact and asymptotic forms are known for 
the mean, variance, and distribution of V; see, e.g., EH .') Here the bins are the ones associated 
to the m "submultisets", and one knows that a bin j is empty from the fact that its corresponding 
register M [j] has preserved its initial value 0. This correction is incorporated in the program of 
Figure [3] 

(Hi) Large range corrections. For cardinalities in the range 1 . . N, with N of the order of 10 9 , hashing 
over at least L = 32 bits should be used (2 32 = 4 • 10 9 ). However, when the cardinality n ap- 
proaches (or perhaps even exceeds) 2 L , then hashing collisions become more and more likely. For 
a randomly chosen hash function, this effect can be modelled by a balls and bins model of the type 
described in the previous paragraph, with now 2 L replacing m. In other words, the quantity E of 
HyperLogLog estimates the number of different hashed values, which is with high probability, 
about 2 L (1 — e~ A ), where A = n/2 L . The inversion of that relation then gives us the approximate 
equation n = — 2 L log(l — E/2 L ), which is the one used in the program. 

Regarding registers, their values a priori range in the interval 0 . . L + 1 — log 2 m. With hashed values 
of 32 bits, this means that 5 bits ("short bytes") are sufficient to store registers (of course, standard 8-bit 
bytes can also be used in some implementations). Regarding the quality of results returned, we expect 
the values of the estimate returned to be approximately Gaussian, due to an averaging effect and the 
Central Limit Theorem: this property is indeed well supported by the simulations of Figure [4] (bottom). 
Accordingly: 

Let a w represent the standard error; the estimates provided by HYPERLOGLOG 

are expected to be within a, 2a, 3a of the exact count in respectively 65%, 95%, 99% of all 
the cases. 

In practice, the HyperLogLog program is quite efficient: like the standard LogLog of |flj)) or 
MinCount of [16 1, its running time is dominated by the computation of the hash function, so that it is 
only three to four times slower than a plain scan of the data (e.g., by the Unix command "wc -1", which 
merely counts end-of-lines). 

Optimality considerations. The near-optimality expressed by our title results from the combination of 
two facts. 

(i) Clearly, maintaining e-approximate counts till a range of N necessitates il(log log N) bits. Indeed, 
the cardinalities should be located in an exponential scale, 

1, (1 + e), (1 + e) 2 , ••• , (l + e) L = N 1 

which comprises log/ 1+e \ N intervals, necessitating at least log 2 log/ 1+e j N bits of information to 
be represented. 

(ii) For a wide class of algorithms based on order statistics, Chassaing and Gerin |6| have shown that 
the best achievable accuracy is bounded from below by a quantity close to 1/y/m. Our algorithm, 
which can be viewed as maintaining approximate order statisticsPjis, on the basis of this result only 



3 In effect, for a multiset S of [0, l]-numbers, the quantity 2 max s (p(«0) j s an approximation to min(S) up to a factor at most 2. 
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• Traces of four typical executions showing the evolution over ideal multisets of the relative error of the estimate 
produced by the H YPERLogLog program as a function of cardinality n, for n < N and for various values of {N, m) : 
(top left) (10 4 , 256); (top right) (10 7 , 1024); (bottom left) (10 s , 1024); (bottom right) (10 7 , 65536). Values of n are 
plotted on a logarithmic scale. 




(The top and botton horizontal lines represent the predicted standard error, namely, ±7% for m = 256, ±3% for m = 1024, and 
±0.5% for m = 65536.) 

• The empirical histogram of the estimates produced by the algorithm (based on 500 and 250 simulations, respectively) and an 
approximate fit by a Gaussian curve for (n, m) = (10 4 , 256), left, or (n, m) = (10 6 , 1024), right. 




Fig. 4: Simulations of the behaviour of the HyperLogLog program, including the low cardinality correction, on 
ideal multisets (random uniform data). 
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Fig. 5: Empirical histogram of the quality of the estimates measured by the relative errors observed, as provided 
by HyperLogLog on "real-life" files. The parameter is m = 2048, corresponding to a standard error of ±2%; the 
experiment has been conducted on 458 chunks of about 40,000 lines each, obtained from octal dumps of the postcript 
source of a forthcoming book {Analytic Combinatorics by Flajolet and Sedgewick). 

about 4% off the information-theoretic optimum of the Chassaing-Gerin class, while using memory 
units that are typically 3 to 5 times shorter. 

As a final summary, the algorithm proves to be easy to code and efficient, being even nearly optimal 
under certain criteria. On "real-life" data, it appears to be in excellent agreement with the theoretical anal- 
ysis, a fact recently verified by extensive tests (see Figure [5]for a sample) conducted by Pranav Kashyap, 
whose contribution is here gratefully acknowledged. The program can be applied to very diverse collec- 
tions of data (only a "good" hash function is needed), and, once duly equipped with corrections, it can 
smoothly cope with a wide range of cardinalities-from very small to very large. In addition, it parallelizes 
or distributes^ optimally and can be adapted to the "sliding window" usage Q. 



All in all, HyperLogLog is highly practical, versatile, and it conforms well to what analysis predicts. 
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